We consider the adversarial convex bandit problem and we build the first$\mathrm{poly}(T)$-time algorithm with $\mathrm{poly}(n) \sqrt{T}$-regret forthis problem. To do so we introduce three new ideas in the derivative-freeoptimization literature: (i) kernel methods, (ii) a generalization of Bernoulliconvolutions, and (iii) a new annealing schedule for exponential weights (withincreasing learning rate). The basic version of our algorithm achieves$\tilde{O}(n^{9.5} \sqrt{T})$-regret, and we show that a simple variant of thisalgorithm can be run in $\mathrm{poly}(n \log(T))$-time per step at the cost ofan additional $\mathrm{poly}(n) T^{o(1)}$ factor in the regret. These resultsimprove upon the $\tilde{O}(n^{11} \sqrt{T})$-regret and$\exp(\mathrm{poly}(T))$-time result of the first two authors, and the$\log(T)^{\mathrm{poly}(n)} \sqrt{T}$-regret and$\log(T)^{\mathrm{poly}(n)}$-time result of Hazan and Li. Furthermore weconjecture that another variant of the algorithm could achieve$\tilde{O}(n^{1.5} \sqrt{T})$-regret, and moreover that this regret isunimprovable (the current best lower bound being $\Omega(n \sqrt{T})$ and it isachieved with linear functions). For the simpler situation of zeroth orderstochastic convex optimization this corresponds to the conjecture that theoptimal query complexity is of order $n^3 / \epsilon^2$.
展开▼